The Tdt-3 Text and Speech Corpus

نویسندگان

  • David Graff
  • Chris Cieri
  • Stephanie Strassel
  • Nii Martey
چکیده

The TDT-3 Text and Speech Corpus expands on previous phases of Topic Detection and Tracking data collections, by increasing the number of news sources being sampled, by including Mandarin Chinese as well as English news data, and by introducing new forms of topic annotation. In order to satisfy the specific data and annotation requirements of the TDT-3 Evaluation Plan[1], the LDC refined and supplemented the methods that had been used in TDT-2 corpus development[2]. There were significant changes and improvements in the process of selecting and defining target topics,in the procedures for quality assurance applied to both data content and annotations, and in the organization of the delivered corpus. In addition, the LDC created or acquired a range of resources to support research in cross-language information retrieval. These included the addition of a Mandarin Chinese component to the TDT-2 Text and Speech Corpus, the collection of a large body of Chinese-English parallel text, and adaptation of Chinese-to-English and English-to-Chinese glossing lexicons. All the resources that we have developed for use by the participants in the TDT-3 Evaluation are being added to the LDC’s catalog of corpora for general availability.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Tdt-2 Text and Speech Corpus

This paper describes the creation and content of the TDT-2 corpus in the context of the TDT-2 research project it supports and in comparison to previous and subsequent efforts

متن کامل

Improved spoken document retrieval by exploring extra acoustic and linguistic cues

In this paper, we explored the use of various extra information to improve the performance of spoken document retrieval (SDR). From the speech recognition perspective, we incorporated the acoustic stress and word confusion information into the audio indexing. From the linguistic perspective, we applied the partof-speech information in both the audio indexing and the query representation. From t...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

Many Uses, Many Annotations for Large Speech Corpora: Switchboard and TDT as Case Studies

This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out ...

متن کامل

Transliteration of Proper Names in Cross-Lingual Information Retrieval

We address the problem of transliterating English names using Chinese orthography in support of cross-lingual speech and text processing applications. We demonstrate the application of statistical machine translation techniques to “translate” the phonemic representation of an English name, obtained by using an automatic text-to-speech system, to a sequence of initials and finals, commonly used ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999